{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Z4jKiGPB__ke",
    "papermill": {
     "duration": 0.011184,
     "end_time": "2021-02-28T17:17:25.882205",
     "exception": false,
     "start_time": "2021-02-28T17:17:25.871021",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "# Application: Heterogeneous Effect of Sex on Wage Using Double Lasso\n",
    "\n",
    "We use US census data from the year 2015 to analyse the effect of gender and interaction effects of other variables with gender on wage jointly. The dependent variable is the logarithm of the wage, the target variable is *female* (in combination with other variables). All other variables denote some other socio-economic characteristics, e.g. marital status, education, and experience.  \n",
    "\n",
    "\n",
    "\n",
    "This analysis allows a closer look how the gender wage gap is related to other socio-economic variables.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "hfOxxCgf__kc"
   },
   "outputs": [],
   "source": [
    "# Import relevant packages\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from scipy.stats import norm\n",
    "from sklearn.linear_model import LinearRegression\n",
    "import patsy\n",
    "import warnings\n",
    "warnings.simplefilter('ignore')\n",
    "np.random.seed(1234)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "mn-ywPJb__kf"
   },
   "outputs": [],
   "source": [
    "file = \"https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/wage2015_subsample_inference.csv\"\n",
    "data = pd.read_csv(file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 320
    },
    "id": "JjAg9We1__kf",
    "outputId": "c8b80534-afe5-4ad4-b131-1ec3e8794321"
   },
   "outputs": [],
   "source": [
    "data.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "kvlmaeFU__kf"
   },
   "source": [
    "Define outcome and regressors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "FheFuCL1__kg",
    "outputId": "e25e22bc-b4f0-4f6d-9b31-e1dcb3631fde"
   },
   "outputs": [],
   "source": [
    "y = np.log(data['wage']).values\n",
    "Z = data.drop(['wage', 'lwage'], axis=1)\n",
    "Z.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "g2vGrvdJ__kg"
   },
   "source": [
    "## Feature Engineering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "JxlhqsDs__kg"
   },
   "source": [
    "Construct all our control variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "tjA6vrOf__kg"
   },
   "outputs": [],
   "source": [
    "# Ultra flexible controls of all pair-wise interactions (around 1k variables); un-comment to run this\n",
    "Zcontrols = patsy.dmatrix('0 + (shs+hsg+scl+clg+C(occ2)+C(ind2)+mw+so+we+exp1+exp2+exp3+exp4)**2',\n",
    "                          Z, return_type='dataframe')\n",
    "\n",
    "Zcontrols = Zcontrols - Zcontrols.mean(axis=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "hD36OSyb__kg"
   },
   "source": [
    "Construct all the variables that we will use to model heterogeneity of effect in a linear manner"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "DNPlH_rs__kh"
   },
   "outputs": [],
   "source": [
    "Zhet = patsy.dmatrix('0 + (shs+hsg+scl+clg+mw+so+we+exp1+exp2+exp3+exp4)',\n",
    "                     Z, return_type='dataframe')\n",
    "Zhet = Zhet - Zhet.mean(axis=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "woskUlaj__kh"
   },
   "source": [
    "Construct all interaction variables between sex and heterogeneity variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "kNsQotu6__kh"
   },
   "outputs": [],
   "source": [
    "Zhet['sex'] = Z['sex']\n",
    "Zinteractions = patsy.dmatrix('0 + sex + sex * (shs+hsg+scl+clg+mw+so+we+exp1+exp2+exp3+exp4)',\n",
    "                              Zhet, return_type='dataframe')\n",
    "interaction_cols = [c for c in Zinteractions.columns if c.startswith('sex')]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "-iMb_R87__ki"
   },
   "source": [
    "Put all the variables together"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "bzFOlDdw__ki",
    "outputId": "1457db9f-f2df-4ec3-c86a-548b3abad072"
   },
   "outputs": [],
   "source": [
    "X = pd.concat([Zinteractions, Zcontrols], axis=1)\n",
    "X.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dJDCw_Zc__ki"
   },
   "source": [
    "## Double Lasso for All Interactive Effects"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "vQvJnv-v__ki"
   },
   "source": [
    "We use \"plug-in\" tuning with a theoretically valid choice of penalty $\\lambda = 2 \\cdot c \\hat{\\sigma}\\sqrt{n}  \\Phi^{-1}(1-\\alpha/2p)$, where $c>1$ and $1-\\alpha$ is a confidence level, and $\\Phi^{-1}$ denotes the quantile function. This choice ensures that the Lasso predictor is well behaved under independence as long as appropriate penalty weights are used.\n",
    "\n",
    "In practice, many people choose to use cross-validation, which is perfectly fine for predictive tasks. However, when conducting inference, to make our analysis valid we will require cross-fitting in addition to cross-validation. As we have not yet discussed cross-fitting, we rely on this theoretically-driven penalty in order to allow for accurate inference in the upcoming notebooks.\n",
    "\n",
    "Note: In the book, we multiply by $\\sqrt{n}$. This is because there, Lasso minimizes the sum of errors. If you were using say sklearn's Lasso whose objective minimizes the average errors, you would instead divide by $\\sqrt{n}$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "6eV_tNex__km"
   },
   "source": [
    "To estimate lasso using the theoretically motivated penalty level, we just use the hdmpy package. To install it run\n",
    "```\n",
    "!pip install multiprocess\n",
    "!git clone https://github.com/maxhuppertz/hdmpy.git\n",
    "```\n",
    "\n",
    "You can run the cells below and then repeat the whole analysis above using the newly defined `lasso_model` variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "M9ihHr7DodEA",
    "outputId": "6a6bfc16-0c37-414b-b030-12f75f81a9d4"
   },
   "outputs": [],
   "source": [
    "!pip install multiprocess\n",
    "!git clone https://github.com/maxhuppertz/hdmpy.git"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "acQhFTC0__km"
   },
   "outputs": [],
   "source": [
    "import sys\n",
    "sys.path.insert(1, \"./hdmpy\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "awA9uVQc__km"
   },
   "outputs": [],
   "source": [
    "# We wrap the package so that it has the familiar sklearn API\n",
    "import hdmpy\n",
    "from sklearn.base import BaseEstimator\n",
    "\n",
    "\n",
    "class RLasso(BaseEstimator):\n",
    "\n",
    "    def __init__(self, *, post=True):\n",
    "        self.post = post\n",
    "\n",
    "    def fit(self, X, y):\n",
    "        self.rlasso_ = hdmpy.rlasso(X, y, post=self.post)\n",
    "        return self\n",
    "\n",
    "    def predict(self, X):\n",
    "        return np.array(X) @ np.array(self.rlasso_.est['beta']).flatten() + np.array(self.rlasso_.est['intercept'])\n",
    "\n",
    "\n",
    "def lasso_model():\n",
    "    return RLasso(post=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 640
    },
    "id": "OBmzPIRZ__ks",
    "outputId": "14f4be52-7f71-4499-9dd3-6ba26bfd71e7"
   },
   "outputs": [],
   "source": [
    "alpha = {}\n",
    "res_y, res_D, epsilon = {}, {}, {}\n",
    "for c in interaction_cols:\n",
    "    print(f\"Double Lasso for target variable {c}\")\n",
    "    D = X[c].values\n",
    "    W = X.drop([c], axis=1)\n",
    "    res_y[c] = y - lasso_model().fit(W, y).predict(W)\n",
    "    res_D[c] = D - lasso_model().fit(W, D).predict(W)\n",
    "    final = LinearRegression(fit_intercept=False).fit(res_D[c].reshape(-1, 1), res_y[c])\n",
    "    epsilon[c] = res_y[c] - final.predict(res_D[c].reshape(-1, 1))\n",
    "    alpha[c] = [final.coef_[0]]\n",
    "\n",
    "# Calculate the covariance matrix of the estimated parameters\n",
    "V = np.zeros((len(interaction_cols), len(interaction_cols)))\n",
    "for it, c in enumerate(interaction_cols):\n",
    "    Jc = np.mean(res_D[c]**2)\n",
    "    for itp, cp in enumerate(interaction_cols):\n",
    "        Jcp = np.mean(res_D[cp]**2)\n",
    "        Sigma = np.mean(res_D[c] * epsilon[c] * epsilon[cp] * res_D[cp])\n",
    "        V[it, itp] = Sigma / (Jc * Jcp)\n",
    "\n",
    "# Calculate standard errors for each parameter\n",
    "n = X.shape[0]\n",
    "for it, c in enumerate(interaction_cols):\n",
    "    alpha[c] += [np.sqrt(V[it, it] / n)]\n",
    "\n",
    "# put all in a dataframe\n",
    "df = pd.DataFrame.from_dict(alpha, orient='index', columns=['point', 'stderr'])\n",
    "\n",
    "# Calculate and pointwise p-value\n",
    "summary = pd.DataFrame()\n",
    "summary['Estimate'] = df['point']\n",
    "summary['Std. Error'] = df['stderr']\n",
    "summary['p-value'] = norm.sf(np.abs(df['point'] / df['stderr']), loc=0, scale=1) * 2\n",
    "summary['ci_lower'] = df['point'] - 1.96 * df['stderr']\n",
    "summary['ci_upper'] = df['point'] + 1.96 * df['stderr']\n",
    "summary"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "DcW2A3JTMBiv"
   },
   "source": [
    "### Joint Confidence Intervals"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 426
    },
    "id": "IPClP_jTeZlr",
    "outputId": "a9309e70-0395-4896-c294-8e1d770740e1"
   },
   "outputs": [],
   "source": [
    "Drootinv = np.diagflat(1 / np.sqrt(np.diag(V)))\n",
    "scaledCov = Drootinv @ V @ Drootinv\n",
    "np.random.seed(123)\n",
    "U = np.random.multivariate_normal(np.zeros(scaledCov.shape[0]), scaledCov, size=10000)\n",
    "z = np.max(np.abs(U), axis=1)\n",
    "c = np.percentile(z, 95)\n",
    "\n",
    "summary = pd.DataFrame()\n",
    "summary['Estimate'] = df['point']\n",
    "summary['CI lower'] = df['point'] - c * df['stderr']\n",
    "summary['CI upper'] = df['point'] + c * df['stderr']\n",
    "summary"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  },
  "papermill": {
   "default_parameters": {},
   "duration": 89.365707,
   "end_time": "2021-02-28T17:18:51.003711",
   "environment_variables": {},
   "exception": null,
   "input_path": "__notebook__.ipynb",
   "output_path": "__notebook__.ipynb",
   "parameters": {},
   "start_time": "2021-02-28T17:17:21.638004",
   "version": "2.2.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
